Learning Fast Classifiers for Image Spam

نویسندگان

  • Mark Dredze
  • Reuven Gevaryahu
  • Ari Elias-Bachrach
چکیده

Recently, spammers have proliferated “image spam”, emails which contain the text of the spam message in a human readable image instead of the message body, making detection by conventional content filters difficult. New techniques are needed to filter these messages. Our goal is to automatically classify an image directly as being spam or ham. We present features that focus on simple properties of the image, making classification as fast as possible. Our evaluation shows that they accurately classify spam images in excess of 90% and up to 99% on real world data. Furthermore, we introduce a new feature selection algorithm that selects features for classification based on their speed as well as predictive power. This technique produces an accurate system that runs in a tiny fraction of the time. Finally, we introduce Just in Time (JIT) feature extraction, which creates features at classification time as needed by the classifier. We demonstrate JIT extraction using a JIT decision tree that further increases system speed. This paper makes image spam classification practical by providing both high accuracy features and a method to learn fast classifiers.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Classification Method for E-mail Spam Using a Hybrid Approach for Feature Selection Optimization

Spam is an unwanted email that is harmful to communications around the world. Spam leads to a growing problem in a personal email, so it would be essential to detect it. Machine learning is very useful to solve this problem as it shows good results in order to learn all the requisite patterns for classification due to its adaptive existence. Nonetheless, in spam detection, there are a large num...

متن کامل

Detecting Image Spam Using Image Texture Features

Filtering image email spam is considered to be a challenging problem because spammers keep modifying the images being used in their campaigns by employing different obfuscation techniques. Therefore, preventing text recognition using Optical Character Recognition (OCR) tools and imposing additional challenges in filtering such type of spam. In this paper, we propose an image spam filtering tech...

متن کامل

A Discriminative Classifier Learning Approach to Image Modeling and Spam Image Identification

We propose a discriminative classifier learning approach to image modeling for spam image identification. We analyze a large number of images extracted from the SpamArchive spam corpora and identify four key spam image properties: color moment, color heterogeneity, conspicuousness, and self-similarity. These properties emerge from a large variety of spam images and are more robust than simply u...

متن کامل

An Incremental Learning Based Framework for Image Spam Filtering

Nowadays, an image spam is an unsolved problem because of two reasons. One is due to the diversity of spamming tricks. The other reason is due to the evolving nature of image spam. As new spam constantly emerging, filters’ effectiveness drops over time. In this paper, we present an effective anti-spam approach to solve the two problems. First, a novel clustering filter is proposed. By exploring...

متن کامل

Multi-View Learning for Web Spam Detection

Spam pages are designed to maliciously appear among the top search results by excessive usage of popular terms. Therefore, spam pages should be removed using an effective and efficient spam detection system. Previous methods for web spam classification used several features from various information sources (page contents, web graph, access logs, etc.) to detect web spam. In this paper, we follo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007